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ABSTRACT 

The truncated Fourier transform (TFT) was introduced by 
van der Hoeven in 2004 as a means of smoothing the "jumps" 
in running time of the ordinary FFT algorithm that occur at 
power-of-two input sizes. However, the TFT still introduces 
these jumps in memory usage. We describe in-place vari- 
ants of the forward and inverse TFT algorithms, achieving 
time complexity 0(n log n) with only 0(1) auxiliary space. 
As an application, we extend the second author's results 
on space-restricted FFT-based polynomial multiplication to 
polynomials of arbitrary degree. 

Categories and Subject Descriptors 

F.2.1 [Analysis of Algorithms and Problem Complex- 
ity]: Numerical Algorithms and Problems — Computations 
on polynomials; G.4 [Mathematical Software]: Algorithm 
design and analysis. Efficiency; 1.1.2 [SymboHc and Alge- 
braic Manipulation]: Algorithms — Algebraic algorithms, 
Analysis of algorithms 



has since become one of the most important and useful tools 
in computer science, especially in the area of signal process- 
ing. 

The FFT algorithm is also important in computer algebra, 
most notably in asymptotically fast methods for integer and 
polynomial multiplication. The first integer multiplication 
algorithm to run in softly linear time relies on the FFT [lO] , 
as do the recent theoretical improvement and the best 
result for polynomial multiplication over arbitrary algebras 
1^ . Moreover, numerous other operations on polynomials — 
including division, evaluation/interpolation, and GCD com- 
putation — have been reduced to multiplication, so more 
efficient multiplication methods have an indirect effect on 
many areas in computer algebra [sj §8-11]. 

The simplest FFT to implement, and the fastest in prac- 
tice, is the radix-2 Cooley-Tukey FFT. Because the radix-2 
FFT requires the size to be a power of two, the simplest so- 
lution for all other sizes is to pad the input polynomials with 
zeros, resulting in large unwanted "jumps" in the complexity 
at powers of two. 



General Terms 

Algorithms, Performance, Theory 

Keywords 

Truncated Fourier transform, fast Fourier transform, poly- 
nomial multiplication, in-place algorithms 

1. INTRODUCTION 
1.1 Background 

The discrete Fourier transform (DFT) is a linear map that 
evaluates a given polynomial at powers of a root of unity. 
Cooley and Tukey [s] were the first to develop an efficient 
method to compute this transform on a digital computer, 
known as the Fast Fourier Transform (FFT). This algorithm 
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1.2 The truncated Fourier transform 

It has been known for some time that if only a subset of 
the DFT output is needed, then the FFT can be truncated 
or "pruned" to reduce the complexity, essentially by disre- 
garding those parts of the computation tree not contributing 
to the desired outputs [sj [Tl] . More recently, van der Ho- 
even took the crucial step of showing how to invert this pro- 
cess, describing a truncated Fourier transform (TFT) and 
an inverse truncated Fourier transform (ITFT), and show- 
ing that this leads to a polynomial multiplication algorithm 
whose running time varies relatively smoothly in the input 
size [12] [13] . 

Specifically, given an input vector of length n < 2*^, the 
TFT computes the first n coefficients of the ordinary Fourier 
transform of length 2*°, and the ITFT computes the inverse 
of this map. The running time of these algorithms smoothly 
interpolates the 0(n log n) complexity of the standard radix- 
2 Cooley-Tukey FFT algorithm. One can therefore deduce 
an asymptotically fast polynomial multiplication algorithm 
that avoids the characteristic "jumps" in running time ex- 
hibited by traditional FFT-based polynomial multiplication 
algorithms when the output degree crosses a power-of-two 
boundary. This observation has been confirmed with prac- 
tical implementations |13[ |7] |6] , with the most marked im- 
provements in the multivariate case. 

One drawback of van der Hoeven's algorithms is that while 



their time complexity varies smoothly with n, their space 
complexity does not. Both the TFT and ITFT operate in a 
buffer of length 2'^'^"^ ; that is, for inputs of length n, they 
require auxiliary storage of 2^'^"^ — n + 0(1) cells to store 
intermediate results, which can be Q{n) in the worst case. 

1.3 Summary of results 

The main results of this paper are TFT and ITFT algo- 
rithms that require only 0(1) auxiliary space, while respect- 
ing the 0(n log n) time bound. 

The new algorithms have their origin in a cache-friendly 
variant of the TFT and ITFT given by the first author [6], 
that builds on Bailey's cache-friendly adaptation of the or- 
dinary FFT [1]. If the transform takes place in a buffer of 
length i = 2 , these algorithms decompose the transform 
into Li — 2^^ row transforms of length L2 ~ 2^^ and L2 
column transforms of length Li, where £1 + £2 ~ Van 
der Hoeven's algorithms correspond to the case Li = 2 and 
1/2 = L/2. To achieve optimal locality, [g] suggests tak- 
ing Li ~ \fL {£i ~ ^/2). In fact, in this case one already 
obtains TFT and ITFT algorithms needing only 0{^/n) aux- 
iliary space. At the other extreme we may take Li = L/2 
and 1/2 = 2, obtaining TFT and ITFT algorithms that use 
only 0(1) space at each recursion level, or O(logn) auxil- 
iary space altogether. In signal processing language, these 
may be regarded as decimation-in-time variants of van der 
Hoeven's decimation-in-frequency algorithms. 

Due to data dependencies in the 0(log n)-space algorithms 
sketched above, the space usage cannot be reduced further 
by simply reordering the arithmetic operations. In this pa- 
per, we show that with a little extra work, increasing the 
implied constant in the 0(n log n) running time bound, it is 
possible to reduce the auxiliary space to only 0(1). To make 
the 0(1) space bound totally explicit, we present our TFT 
and ITFT algorithms (Algorithms [T] and [2| in an iterative 
fashion, with no recursion. Since we do not have space to 
store all the necessary roots of unity, we explicitly include 
steps to compute them on the fly; this is non-trivial because 
the decimation-in-time approach requires indexing the roots 
in bit-reversed order. 

As an application, we generalize the second author's space- 
restricted polynomial multiplication algorithm [9]. Consider 
a model in which the input polynomials are considered read- 
only, but the output buffer may be read from and written 
to multiple times. The second author showed that in such 
a model, it is possible to multiply polynomials of degree 
n = 2'= - 1 in time 0(n log n) using only 0(1) auxiliary 
space. Using the new in-place ITFT, we generalize this re- 
sult to polynomials of arbitrary degree. 

2. PRELIMINARIES 
2.1 Computational model 

We work over a ring R containing 2*'-th roots of unity for 
all (suitably large) k, and in which 2 is not a zero-divisor. 

Our memory model is similar to that used in the study of 
in-place algorithms for sorting and geometric problems, com- 
bined with the well-studied notion of algebraic complexity. 
Specifically, we allow two primitive types in memory: ring 
elements and pointers. A ring element is any single element 
of R, and the input to any algorithm will consist of n such 
elements stored in an array. A pointer can hold a single 



integer a G Z in the range ~cn < a < cn for some fixed 
constant c £ N. (In our algorithms, we could take c = 2.) 

We say an algorithm is in-place if it overwrites its input 
buffer with the output. In this case, any element in this 
single input/output array may be read from or written to 
in constant time. Our in-place truncated Fourier transform 
algorithms (Algorithms [l] and [2| fall under this model. 

An out-of-place algorithm uses separate memory locations 
for input and output. Here, any element from the input ar- 
ray may be read from in constant time (but not overwritten), 
and any element in the output array may be read from or 
written to in constant time as well. This will be the situation 
in our multiplication algorithm (Algorithm [3|. 

The algorithms also need to store some number of pointers 
and ring elements not in the input or output arrays, which 
we define to be the auxiliary storage used by the algorithm. 
All the algorithms we present will use only 0(1) auxiliary 
storage space. 

This model should correspond well with practice, at least 
when the computations are performed in main memory and 
the ring R is finite. 

2.2 DFT notation 

We denote by u^k] a primitive 2'°-th root unity, and we 
assume that these are chosen compatibly, so that (^fk+i] ~ 
ijj[k] for all fc > 0. Define a sequence of roots oJOjOJi, ... by 
ujs = ^[k]'' " ' where k > [lg(s + 1)] and rev^ s denotes the 
length-fc bit-reversal of s. Thus we have 

^0 ~ ^[0] {~ 1) = l^[2] ^4 = l^[3] '^6 = l^fs] 

LUl = {= -1) OJs = Ujf2] '^i = ^[3] ^7 = ^^^3] 

and so on. Note that 

1^2s + l = —l^2a and = i^2s + l = ^s. 

If -F G R[x] is a polynomial with deg F < n, we write 
Fs for the coefficient of x" in F, and we define the Fourier 
transform F by 

F, = F{lj,). 

In Algorithms [1] and [2] below, we decompose F as 

F(x) = G{x^) +xH{x'^), 

where degG < \n/2] and deg_ff < [n/2j. Using the prop- 
erties of ujs mentioned above, we obtain the "butterfly" re- 
lations 

F2s — Gs + UJ2sF[s, 
F2S + I ~ Gs — l-02aHa. 

Both the TFT and ITFT algorithm require, at each recur- 
sive level, iterating through a set of index-root pairs such 
as {(i,a;i),0 < i < n}. A traditional, time-efficient ap- 
proach would be to precompute all powers of store 
them in reverted-binary order, and then pass through this 
array with a single pointer. However, this is impossible un- 
der the restriction that no auxiliary storage space be used. 
Instead, we will compute the roots on-the-fly by iterating 
through the powers of ajf^j in order, and through the indices 
i in bit-reversed order. Observe that incrementing an in- 
teger counter through rev^ 0, revfc 1, rev^ 2, . . . can be done 
in exactly the same way as incrementing through 0, 1, 2, . . ., 
which is possible in-place and in amortized constant time. 



(0,0) = {0,1, 2, 3, 4, 5} 




Figure 1: TFT tree for n = 6 



3. SPACE-RESTRICTED TFT 

In this section we describe an in-place TFT algorithm that 
uses only 0(1) auxiliary space (Algorithm [T|. The routine 
operates on a buffer Xo, ■ ■ ■ , X„-i containing elements of R. 
It takes as input a root of unity of sufficiently high order and 
the coefficients Fo, . . . , of a polynomial F £ R[x], and 

overwrites these with Fo, • ■ • , Jn-i- 

The pattern of the algorithm is recursive, but we avoid 
recursion by explicitly moving through the recursion tree, 
avoiding unnecessary space usage. An example tree for n — 
6 is shown in Figure [l] The node S — {q, r) represents the 
subarray with offset q and stride 2''; the ith element in this 
subarray is Si — Xq+i.2'-, and the length of the subarray is 
given by 



len(5') = p 



The root is (0,0), corresponding to the entire input array 
of length n. Each subarray of length 1 corresponds to a leaf 
node, and we define the predicate IsLeaf(S') to be true iff 
len(S') = 1. Each non-leaf node splits into even and odd 
child nodes. To facilitate the path through the tree, we 
define 

Even(g, r) = (g, r + 1), 
Odd(g,r) = (g + 2",r + 1) 

if {q, r) is not a leaf, 

Parent(,r)-|(^'"-^)' < ^-^S 

Farent(g, rj - 2'-i,r - 1), g > 2^-' 

if {q, r) is not the root, and for any node we define 

LeftmostLeaf(S) = |^' , , l^l^^^m, 

1 LeftmostLeaf (Even(S')), otherwise. 

We begin with the following lemma. 

Lemma 3.1. Let N he a node with len(7V) = I, and let 

A{x) = ^ A,x' e R[x\. 

Q<i<l 

If S ^ LeftmostLeaf (Af) and Nt ^ Ai for < i < £ before 
some iteration of line [5| in Algorithm [7] then after a fimte 
number of steps, we will have S = N and Ni — Ai for 
< i < £, before the execution of line No other array 
entries in X are affected. 



Algorithm 1: InplaceTFT([A'o, . . .,X„-i]) 

Input: A"i = Fi for < j < n, where F £ R[x], 

deg F < n 
Output: Xi ^ Fi for < i < n 



S LeftmostLeaf (0, 0) 
prev null 
while true do 

m len(5) 
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9 
10 

11 
12 
13 
14 

15 
16 



if IsLeaf(5') or prev = Odd(5') then 

for (i, 61) G {(j, u}2j) : < j < [m/2j } do 

S2i ^ S2i + 0S2i+l 
S2i+1 S2i — dS2i+l 

if S = (0, 0) then halt 

prev S 
S ^ Parent (S) 

else if prev — Even(S') then 
if len(S) = 1 mod 2 then 

V ^ Y,^=0^^^'^ S2i + 1 ■ (t^{m-l)/2) 
Sm~l <~ Sm-1 + V ■ CJm-l 

prev S 

S ^ LeftmostLeaf (Odd(5')) 



Proof. The proof is by induction on ^. If ^ = 1, then 
IsLeaf(A'^) is true and Aq = Aq so we are done. So assume 
i > 1 and that the lemma holds for all shorter lengths. 

Decompose A as A{x) — G{x'^) + xH{x^). Since S = 
LeftmostLeaf (Even(A'^)) as well, the induction hypothesis 
guarantees that the even-indexed elements of N, correspond- 
ing to the coefficients of G, will be transformed into G, and 
we will have S = Even(A') before linejs] The following lines 
set prev = Even(A) and S — N, so that lines |12|jl6| are 
executed on the next iteration. 

If £ is odd, then (£-l)/2 > len(Odd(A')), so H(e^i)/2 will 
not be computed in the odd subtree, and we will not be able 
to apply ^ to compute Ae-i = G'(^_i)/2 + uje-iH(e-i)/2. 
This is why, in this case, we explicitly compute 



{e-i)/2) 



(e-i)/2 



on line 



13 



14 



before 



and then compute Ai-i directly on line ' 
descending into the odd subtree. 

Another application of the induction hypothesis guaran- 



tees that we will return to line [s] with S — Odd(A'^) after 
computing N2i+i = Hi for < i < [^/2J. The following 
lines set prev = Odd(A'^) and S — N, and we arrive at linejo] 
on the next iteration. The for loop thus properly applies 
the butterfly relations |TJ to compute Ai for < i < 2[£/2\ , 
which completes the proof. □ 

Now we are ready for the main result of this section. 

Proposition 3.2. Algorithm^^ correctly computes Fi for 
< i < n. It performs 0(n log n) ring and pointer opera- 
tions, and uses 0(1) auxiliary space. 

Proof. The correctness follows immediately from Lemma 
3.1 since we start with S — LeftmostLeaf (0, 0), which is the 
first leaf of the whole tree. The space bound is immediate 
since each variable has constant size. 

To verify the time bound, notice that the while loop vis- 
its each leaf node once and each non-leaf node twice (once 
with prev — Even(S') and once with prev = Odd(5')). Since 
always q < 2^ < 2n, there are 0{n) iterations through the 
while loop, each of which has cost C'(len(S') + logn). This 
gives the total cost of O(nlogn). □ 

4. SPACE-RESTRICTED ITFT 

Next we describe an in-place inverse TFT algorithm that 
uses 0(1) auxiliary space (Algorithm [2| . It takes as input 
Fo, ■ ■ ■ , Fn-i for some polynomial F £ R[x], degF < n, and 
overwrites the buffer with _Fo, ■ ■ • , ^n-i- 

The path of the algorithm is exactly the reverse of Algo- 
rithm [l] and we use the same notation as before to move 
through the tree. We only require one additional function: 



RightmostParent(S) — 

I RightmostParent(Parent(S')), 



S = Odd(Parent(5')), 
otherwise. 



If 



then 



LeftmostLeaf(Odd(iVi)) = N2 



Parent (RightmostParent(iV2)) = iVi, 

so RightmostParent computes the inverse of the assignment 
on line [16] in Algorithm [T] 

We leave it to the reader to confirm that the structure 
of the recursion is identical to that of Algorithm [l] but in 
reverse, from which the following analogues of Lemma |3.1| 
and Proposition |3. 2| follow immediately: 

Lemma 4.1. Let N be a node with len(A'') = £, and 

J2 A,x' e R\x\. 
o<i<e 



Aix) 



If S = N and Ni = Ai for < i < £ before some iteration of 
line^in Algorithm^ then after a finite number of steps, we 
will have S — LeftmostLeaf (A'^) and Ni = Ai for < i < £ 
before some iteration of line^ No other array entries m X 
are affected. 

Proposition 4.2. Algorithm^ correctly computes Fi for 
< i < n. It performs 0{n\ogn) ring and pointer opera- 
tions, and uses 0(1) auxiliary space. 

The fact that our InplaceTFT and InplacelTFT algo- 
rithms are essentially reverses of each other is an interesting 
feature not shared by the original formulations in [12]. 



Algorithm 2: InplacelTFT ([Ao, . . . , A„_i]) 

Input: Xi = Fi for < i < n, where F £ R[x], 

deg F < n 
Output: Xi = Fi for < i < n 

15"^ (0,0) 

2 while 5" / LeftmostLeaf (0, 0) do 

3 if IsLeaf(5') then 
S -(r- Parent(RightmostParent(S')) 
m <— len(5') 

if len(S) = 1 mod 2 then 

•sr-^.(m~3) /2 o i 
" ^ 1^1=0 -S^ai + l ■ ^(n,-l)/2 

Sm-1 Sm~l — V ■ iOm-1 



9 

10 
11 
12 

13 
14 



S ^ Even(S') 

else 

m len(S') 

for (1,9) e : < j < [m/2\} do 



S2i 
S2i+1 

S ^ Odd(S) 



(5*21 + S2i+l)/2 
' ■ {S2i — S2i+l)/2 



5. POLYNOMIAL MULTIPLICATION 

We now describe the multiplication algorithm alluded to 
in the introduction. The strategy is similar to that of [9], 
with a slightly more complicated "folding" step. The input 
consists of two polynomials A,B£ R[x] with deg A < n and 
degiJ < m. The routine is supplied an output buffer X of 
length r = n-fm — 1 in which to write the product C = AB. 

The subroutine FFT has the same interface as InplaceTFT, 
but is only called for power-of-two length inputs. 



Algorithm 3: Space-restricted product 
Input: A,B £ R[x], deg A < m, degB < n 
Output: Xs = Cs for 0<s<n + m — 1, where 
C = AB 

1 r<— n + m— 1 

2 g ^ 

3 while g < r — 1 do 

£ ^ Mr -q)]-l 
L^2' 



4 
5 
6 
7 
8 

9 
10 
11 



[Xq,Xq + l, . . . ,Xq + 2L-l] <~ [0, 0, . . . , 0] 

for < z < m do 

[_ Ag-|_(i mod L) ^ ^q+{i mod L) 

FFT([A„A,+i,...,A,+i-i]) 
for < i < n do 

I Aq + /, + (i mod L) Xa 



q + L + {i mod L) + ijJqBi 

12 FFT([Ag+_L, Aq+L+1, . . . , Aq+2i:,-i]) 

13 for < i < L do 

14 ^ XqJ^i Xq+iXq + L-^i 

15 g g -I- L 

16 Xr-l A{u]r-l)B{uir-l) 

17 InplacelTFT ([Ao, . . . , A^-i]) 



Proposition 5.1. Algorithm^correctly computes the prod- 
uct C = AB, in time 0{{m -\- n) log(m + n)) and using 0(1) 



auxiliary space. 

Proof. The main loop terminates since q is strictly in- 
creasing. Let A'' be the number of iterations, and let qo > 
qi > ■ ■ ■ > gjv-i and Lq > Li > ■ ■ • > Lm~i be the values 
of q and L on each iteration. By construction, the intervals 
[qi,qi + Li) form a partition of [0,r — 1), and Li is the largest 
power of two such that qi + 2Li < r. Therefore each L can 
appear at most twice (i.e. if Li = Li-i then Li+i < Li), 
A'' < 2 Igr, and we have Li | qi for each i. 

At each iteration, lines (THsI compute the coefficients of 
the polynomial A{ujqx) mod a:: — 1, placing the result in 
[Xq, . . . , Xq+i-i]. Line |9] then computes X^+i — A{ijjqU)i) 
for < i < L. Since Lj g we have uiqUJi = uiq+i, and so 
we have actually computed Xq+i — Aq+i for < i < L. 
The next two lines similarly compute Jfg+^+i = I3q+i for 
< i < L. (The point of the condition q+2L < r is to ensure 
that both of these transforms fit into the output buffer.) 
Lines 13 14 then compute X, 



0<i < L. 
After line 



q + i 



Aq-^iBq^i 



Cq+i for 



16 



we finally have Xg — d for all < i < r. 
(The last product was handled separately since the output 
buffer does not have room for the two Fourier coefficients.) 
Line 1 17| then recovers Co, ■ ■ ■ , Cr-i- 

We now analyze the time and space complexity. The loops 
on lines [6] [t] |10| and 13 contribute 0(r) operations per it- 
eration, or 0(r log r) in total, since A'^ — O(logr). The 
FFT calls contribute 0(Li log Li) per iteration, for a to- 
tal of O(E,i.logL0 = O(E^aogL) = 0(r log r). Line 
16 contribute 0{r), and line 17 contributes 0(r log r) by 



The space requirements are immediate also 
since the main loop requires only 0(1) 



4.2 



Proposition |4.2 
by Proposition 
space. □ 



6. CONCLUSION 

We have demonstrated that forward and inverse radix- 
2 truncated Fourier transforms can be computed in-place 
using 0(71 log n) time and 0(1) auxiliary storage. As a re- 
sult, polynomials with degrees less than n can be multiplied 
out-of-place within the same time and space bounds. These 
results apply to any size n, whenever the underlying ring 
admits division by 2 and a primitive root of unity of order 

2 rig "1 

Numerous questions remain open in this direction. First, 
our in-place TFT and ITFT algorithms avoid using auxiliary 
space at the cost of some extra arithmetic. So although 
the asymptotic complexity is still 0(n log n), the implied 
constant will be greater than for the usual TFT or FFT 
algorithms. It would be interesting to know whether this 
extra cost is unavoidable. In any case, the implied constant 
would need to be reduced as much as possible for the in- 
place TFT/ITFT to compete with the running time of the 
original algorithms. 

We also have not yet demonstrated an in-place multi- 
dimensional TFT or ITFT algorithm. In one dimension, 
the ordinary TFT can hope to gain at most a factor of two 
over the FFT, but a d-dimensional TFT can be faster than 
the corresponding FFT by a factor of 2**, as demonstrated in 
[7] . An in-place variant along the lines of the algorithms pre- 
sented in this paper could save a factor of 2'* in both time 
and memory, with practical consequences for multivariate 
polynomial arithmetic. 

Finally, noticing that our multiplication algorithm, de- 



spite using only 0(1) auxiliary storage, is still an out-of- 
place algorithm, we restate an open question of [9]: Is it 
possible, under any time restrictions, to perform multipli- 
cation in-place and using only 0(1) auxiliary storage? The 
answer seems to be no, but a proof is as yet elusive. 
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